Research Problem

Last year, I read a paper titled, "Feature Selection Methods for Identifying Genetic Determinants of Host Species in RNA Viruses". This year, I read another paper titled, "Predicting host tropism of influenza A virus proteins using random forest". The essence of these papers were to predict influenza virus host tropism from sequence features. The particular feature engineering steps were somewhat distinct, in which the former used amino acid sequences encoded as binary 1/0s, while the latter used physiochemical characteristics of the amino acid sequences instead. However, the core problem was essentially identical - predict a host classification from influenza protein sequence features. Random forest classifiers were used in both papers, and is a powerful method for identifying non-linear mappings from features to class labels. My question here was to see if I could get comparable performance using a simple neural network.

Data

I downloaded influenza HA sequences from the Influenza Research Database. Sequences dated from 1980 to 2015. Lab strains were excluded, duplicates allowed (captures host tropism of certain sequences). All viral subtypes were included.

Below, let's take a deep dive into what it takes to construct an artificial neural network!

The imports necessary for running this notebook.


In [1]:
import pandas as pd
import numpy as np
from Bio import SeqIO
from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment
from collections import Counter
from sklearn.preprocessing import LabelBinarizer
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.metrics import mutual_info_score as mi

Read in the viral sequences.


In [2]:
sequences = SeqIO.to_dict(SeqIO.parse('20150902_nnet_ha.fasta', 'fasta'))
# sequences

The sequences are going to be of variable length. To avoid the problem of doing multiple sequence alignments, filter to just the most common length (i.e. 566 amino acids).


In [3]:
lengths = Counter()
for accession, seqrecord in sequences.items():
    lengths[len(seqrecord.seq)] += 1
    
lengths.most_common(1)[0][0]


Out[3]:
566

There are sequences that are ambiguously labeled. For example, "Environment" and "Avian" samples. We would like to give a more detailed prediction as to which hosts it likely came from. Therefore, take out the "Environment" and "Avian" samples.


In [74]:
# For convenience, we will only work with amino acid sequencees of length 566.
final_sequences = dict()
for accession, seqrecord in sequences.items():
    host = seqrecord.id.split('|')[1]
    if len(seqrecord.seq) == lengths.most_common(1)[0][0]:
        final_sequences[accession] = seqrecord

Create a numpy array to store the alignment.


In [72]:
alignment = MultipleSeqAlignment(all_sequences.values())
alignment = np.array([list(rec) for rec in alignment], np.character)

The first piece of meat in the code begins here. In the cell below, we convert the sequence matrix into a series of binary 1s and 0s, to encode the features as numbers. This is important - AFAIK, almost all machine learning algorithms require numerical inputs.


In [73]:
# Create an empty dataframe.
df = pd.DataFrame()

# Create a dictionary of position + label binarizer objects.
pos_lb = dict()

for pos in range(lengths.most_common(1)[0][0]):
    # Convert position 0 by binarization.
    lb = LabelBinarizer()
    # Fit to the alignment at that position.
    lb.fit(alignment[:,pos])
    # Add the label binarizer to the dictionary.
    pos_lb[pos] = lb
    # Create a dataframe.
    pos = pd.DataFrame(lb.transform(alignment[:,pos]))

    # Append the columns to the dataframe.
    for col in pos.columns:
        maxcol = len(df.columns)
        df[maxcol + 1] = pos[col]
    
df['host'] = [s.id.split('|')[1] for s in all_sequences.values()]
df


Out[73]:
1 2 3 4 5 6 7 8 9 10 ... 6742 6743 6744 6745 6746 6747 6748 6749 6750 host
0 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
1 0 0 0 1 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 Swine
2 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
3 0 0 0 1 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 Swine
4 0 0 0 1 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 Swine
5 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
6 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Swine
7 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
8 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
9 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Swine
10 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Quail
11 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Duck
12 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
13 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
14 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Swine
15 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
16 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
17 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
18 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Swine
19 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Swine
20 0 0 0 1 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 Swine
21 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
22 0 0 0 0 0 1 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 Duck
23 0 0 0 1 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 Mallard
24 0 0 0 1 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 Chicken
25 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Swine
26 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
27 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28 0 0 0 0 0 1 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 Duck
29 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
28196 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28197 0 0 0 1 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 Mallard
28198 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Mallard
28199 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28200 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28201 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28202 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28203 0 0 0 0 0 1 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 Duck
28204 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28205 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28206 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28207 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28208 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28209 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Blue_Winged_Teal
28210 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28211 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28212 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Pacific_Golden_Plover
28213 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28214 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28215 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Mallard
28216 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28217 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Ruddy_Turnstone
28218 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28219 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28220 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Swine
28221 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Swine
28222 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28223 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28224 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Human
28225 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 Swine

28226 rows × 6751 columns


In [84]:
ambiguous_hosts = ['Environment', 'Avian', 'Unknown', 'NA', 'Bird', 'Sea_Mammal', 'Aquatic_Bird']

unknowns = df[df['host'].isin(ambiguous_hosts)]

train_test_df = df[df['host'].isin(ambiguous_hosts) == False]

With the cell above, we now have a sequence feature matrix, in which the 566 amino acids positions have been expanded to 6750 columns of binary sequence features.

The next step is to grab out the host species labels, and encode them as 1s and 0s as well.


In [85]:
# Grab out the labels.
output_lb = LabelBinarizer()
output_lb.fit(train_test_df['host'])
Y = output_lb.transform(train_test_df['host'])
Y = Y.astype(np.float32)  # Necessary for passing the data into nolearn.
Y.shape


Out[85]:
(28071, 97)

In [86]:
X = train_test_df.ix[:,:-1].values
X = X.astype(np.float32)  # Necessary for passing the data into nolearn.
X.shape


Out[86]:
(28071, 6750)

Next up, we do the train/test split.


In [87]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

For comparison, let's train a random forest classifier, and see what the concordance is between the predicted labels and the actual labels.


In [88]:
rf = RandomForestClassifier()
rf.fit(X_train, Y_train)
predictions = rf.predict(X_test)
predicted_labels = output_lb.inverse_transform(predictions)
# Compute the mutual information between the predicted labels and the actual labels.
mi(predicted_labels, output_lb.inverse_transform(Y_test))


Out[88]:
0.84514995270740723

By the majority-consensus rule, and using mutual information as the metric for scoring, things look not so bad! As mentioned above, the RandomForestClassifier is a pretty powerful method for finding non-linear patterns between features and class labels.

Uncomment the cell below if you want to try the scikit-learn's ExtraTreesClassifier.


In [89]:
# et = ExtraTreesClassifier()
# et.fit(X_train, Y_train)
# predictions = et.predict(X_test)
# predicted_labels = output_lb.inverse_transform(predictions)
# mi(predicted_labels, output_lb.inverse_transform(Y_test))

As a demonstration of how this model can be used, let's look at the ambiguously labeled sequences, i.e. those from "Environment" and "Avian", to see whether we can make a prediction as to what host it likely came frome.


In [90]:
unknown_hosts = unknowns.ix[:,:-1].values

preds = rf.predict(unknown_hosts)
output_lb.inverse_transform(preds)


Out[90]:
array(['American_Black_Duck', 'Mallard', 'Human', 'Blue_Winged_Teal',
       'American_Black_Duck', 'Blue_Winged_Teal', 'Blue_Winged_Teal',
       'Gull', 'Blue_Winged_Teal', 'American_Black_Duck', 'Mallard',
       'Blue_Winged_Teal', 'Duck', 'American_Black_Duck',
       'Blue_Winged_Teal', 'Guineafowl', 'American_Black_Duck', 'Chicken',
       'Blue_Winged_Teal', 'American_Black_Duck', 'Blue_Winged_Teal',
       'Blue_Winged_Teal', 'American_Black_Duck', 'Blue_Winged_Teal',
       'American_Black_Duck', 'American_Black_Duck', 'Blue_Winged_Teal',
       'Green_Winged_Teal', 'Blue_Winged_Teal', 'American_Black_Duck',
       'American_Black_Duck', 'American_Black_Duck', 'American_Black_Duck',
       'American_Black_Duck', 'Gull', 'Blue_Winged_Teal',
       'Blue_Winged_Teal', 'Human', 'Duck', 'Duck', 'Human',
       'American_Black_Duck', 'Duck', 'Blue_Winged_Teal',
       'Blue_Winged_Teal', 'Human', 'Blue_Winged_Teal',
       'American_Black_Duck', 'Blue_Winged_Teal', 'American_Black_Duck',
       'Blue_Winged_Teal', 'Duck', 'Mallard', 'Blue_Winged_Teal',
       'American_Black_Duck', 'Blue_Winged_Teal', 'Duck',
       'Blue_Winged_Teal', 'American_Black_Duck', 'Human', 'Duck',
       'Blue_Winged_Teal', 'Herring_Gull', 'Blue_Winged_Teal',
       'Blue_Winged_Teal', 'Duck', 'Blue_Winged_Teal',
       'American_Black_Duck', 'Blue_Winged_Teal', 'Duck',
       'Blue_Winged_Teal', 'Chicken', 'Human', 'Duck',
       'American_Black_Duck', 'Mallard', 'Blue_Winged_Teal', 'Duck',
       'Human', 'Blue_Winged_Teal', 'Blue_Winged_Teal', 'Blue_Winged_Teal',
       'Blue_Winged_Teal', 'American_Black_Duck', 'Turkey',
       'Blue_Winged_Teal', 'Blue_Winged_Teal', 'Blue_Winged_Teal', 'Human',
       'American_Black_Duck', 'Chicken', 'American_Black_Duck', 'Duck',
       'American_Black_Duck', 'Blue_Winged_Teal', 'Blue_Winged_Teal',
       'Mallard', 'Chicken', 'Blue_Winged_Teal', 'American_Black_Duck',
       'Mallard', 'Mallard', 'American_Black_Duck', 'Human',
       'American_Black_Duck', 'American_Black_Duck', 'Duck',
       'American_Black_Duck', 'Black_Duck', 'Mallard',
       'American_Black_Duck', 'Blue_Winged_Teal', 'Blue_Winged_Teal',
       'American_Black_Duck', 'Blue_Winged_Teal', 'Blue_Winged_Teal',
       'Human', 'Human', 'Blue_Winged_Teal', 'Duck', 'American_Black_Duck',
       'American_Black_Duck', 'American_Black_Duck', 'Herring_Gull',
       'Human', 'American_Black_Duck', 'Blue_Winged_Teal',
       'American_Black_Duck', 'American_Black_Duck', 'Shorebird',
       'American_Black_Duck', 'Blue_Winged_Teal', 'American_Black_Duck',
       'Human', 'American_Black_Duck', 'American_Black_Duck', 'Duck',
       'American_Black_Duck', 'Mallard', 'Blue_Winged_Teal',
       'American_Black_Duck', 'American_Black_Duck', 'American_Black_Duck',
       'Duck', 'Duck', 'American_Black_Duck', 'American_Black_Duck',
       'Blue_Winged_Teal', 'Mallard', 'American_Black_Duck',
       'Blue_Winged_Teal', 'Blue_Winged_Teal', 'Blue_Winged_Teal',
       'Blue_Winged_Teal', 'Human'], 
      dtype='<U27')

Alrighty - we're now ready to try out a neural network! For this try, we will use lasagne and nolearn, two packages which have made things pretty easy for building neural networks. In this segment, I'm going to not show experiments with multiple architectures, activations and the like. The goal is to illustrate how easy the specification of a neural network is.

First off, these are the import statements necessary for constructing and testing a neural network.


In [91]:
from lasagne import layers
from lasagne.updates import nesterov_momentum
from nolearn.lasagne import NeuralNet
import theano
theano.config.blas.ldflags = '' # optional. Blas was giving me an error.

The network architecture that we'll try is as such:

  • 1 input layer, of shape 6750 (i.e. taking in the columns as data).
  • 1 hidden layer, with 300 units.
  • 1 output layer, of shape 140 (i.e. each of the class labels).

In [92]:
net1 = NeuralNet(layers=[
        ('input', layers.InputLayer),
        ('hidden', layers.DenseLayer),
        ('output', layers.DenseLayer),
    ],
        # Layer parameters:
        input_shape=(None, X.shape[1]),
        hidden_num_units=300,
        output_nonlinearity=None,
        output_num_units=Y.shape[1],
        #allow_input_downcast=True,

        # Optimization Method:
        update=nesterov_momentum,
        update_learning_rate=0.01,
        update_momentum=0.9,
                
        regression=True,
        max_epochs=100,
        verbose=1
        )

Training a simple neural network on my MacBook Air takes quite a bit of time :). But the function call for fitting it is a simple nnet.fit(X, Y).


In [52]:
net1.fit(X_train, Y_train)


# Neural Network with 2056002 learnable parameters

## Layer information

  #  name      size
---  ------  ------
  0  input     6750
  1  hidden     300
  2  output     102

  epoch    train loss    valid loss    train/val  dur
-------  ------------  ------------  -----------  -----
      1       0.01125       0.00529      2.12700  8.45s
      2       0.00463       0.00425      1.08914  8.77s
      3       0.00391       0.00375      1.04357  7.77s
      4       0.00353       0.00345      1.02237  8.90s
      5       0.00329       0.00325      1.01102  9.40s
      6       0.00312       0.00310      1.00536  8.61s
      7       0.00299       0.00299      1.00178  7.99s
      8       0.00289       0.00289      0.99940  7.84s
      9       0.00280       0.00281      0.99782  8.08s
     10       0.00273       0.00274      0.99675  9.84s
     11       0.00266       0.00267      0.99616  10.06s
     12       0.00260       0.00261      0.99576  8.84s
     13       0.00255       0.00256      0.99548  8.30s
     14       0.00250       0.00251      0.99497  9.70s
     15       0.00245       0.00247      0.99476  10.68s
     16       0.00241       0.00243      0.99472  9.45s
     17       0.00237       0.00239      0.99481  8.38s
     18       0.00234       0.00235      0.99462  7.74s
     19       0.00230       0.00232      0.99465  8.59s
     20       0.00227       0.00229      0.99452  7.89s
     21       0.00224       0.00226      0.99451  8.39s
     22       0.00222       0.00223      0.99438  8.37s
     23       0.00219       0.00220      0.99433  8.83s
     24       0.00217       0.00218      0.99429  8.29s
     25       0.00215       0.00216      0.99420  8.03s
     26       0.00213       0.00214      0.99412  8.74s
     27       0.00211       0.00212      0.99410  8.49s
     28       0.00209       0.00210      0.99406  8.55s
     29       0.00207       0.00209      0.99396  8.49s
     30       0.00206       0.00207      0.99384  9.09s
     31       0.00204       0.00205      0.99380  7.88s
     32       0.00203       0.00204      0.99375  7.90s
     33       0.00201       0.00203      0.99369  8.87s
     34       0.00200       0.00201      0.99371  8.52s
     35       0.00199       0.00200      0.99364  10.17s
     36       0.00198       0.00199      0.99359  8.75s
     37       0.00196       0.00198      0.99354  11.07s
     38       0.00195       0.00197      0.99348  12.12s
     39       0.00194       0.00196      0.99341  9.84s
     40       0.00193       0.00195      0.99332  8.24s
     41       0.00192       0.00194      0.99326  7.89s
     42       0.00191       0.00193      0.99320  8.63s
     43       0.00190       0.00192      0.99316  8.71s
     44       0.00190       0.00191      0.99309  9.97s
     45       0.00189       0.00190      0.99302  8.96s
     46       0.00188       0.00189      0.99293  9.66s
     47       0.00187       0.00188      0.99284  10.56s
     48       0.00186       0.00188      0.99276  9.30s
     49       0.00186       0.00187      0.99265  9.24s
     50       0.00185       0.00186      0.99254  9.50s
     51       0.00184       0.00186      0.99243  11.45s
     52       0.00183       0.00185      0.99231  11.97s
     53       0.00183       0.00184      0.99219  11.64s
     54       0.00182       0.00184      0.99210  13.53s
     55       0.00181       0.00183      0.99199  12.19s
     56       0.00181       0.00182      0.99189  11.70s
     57       0.00180       0.00182      0.99176  12.60s
     58       0.00180       0.00181      0.99165  11.37s
     59       0.00179       0.00181      0.99151  11.79s
     60       0.00179       0.00180      0.99140  11.76s
     61       0.00178       0.00180      0.99128  11.60s
     62       0.00177       0.00179      0.99114  12.20s
     63       0.00177       0.00178      0.99102  12.98s
     64       0.00176       0.00178      0.99090  13.48s
     65       0.00176       0.00177      0.99076  14.15s
     66       0.00175       0.00177      0.99062  13.75s
     67       0.00175       0.00177      0.99050  13.71s
     68       0.00174       0.00176      0.99035  13.35s
     69       0.00174       0.00176      0.99020  13.38s
     70       0.00173       0.00175      0.99007  13.47s
     71       0.00173       0.00175      0.98992  13.64s
     72       0.00173       0.00174      0.98978  13.37s
     73       0.00172       0.00174      0.98966  14.40s
     74       0.00172       0.00173      0.98952  13.91s
     75       0.00171       0.00173      0.98937  13.91s
     76       0.00171       0.00173      0.98922  13.86s
     77       0.00170       0.00172      0.98910  14.91s
     78       0.00170       0.00172      0.98898  14.39s
     79       0.00170       0.00171      0.98884  14.17s
     80       0.00169       0.00171      0.98871  14.69s
     81       0.00169       0.00171      0.98858  15.49s
     82       0.00168       0.00170      0.98847  14.70s
     83       0.00168       0.00170      0.98834  14.17s
     84       0.00168       0.00170      0.98820  13.87s
     85       0.00167       0.00169      0.98810  15.65s
     86       0.00167       0.00169      0.98796  15.96s
     87       0.00167       0.00169      0.98784  14.88s
     88       0.00166       0.00168      0.98769  15.96s
     89       0.00166       0.00168      0.98758  15.62s
     90       0.00165       0.00168      0.98744  14.07s
     91       0.00165       0.00167      0.98733  16.89s
     92       0.00165       0.00167      0.98721  16.43s
     93       0.00165       0.00167      0.98710  15.48s
     94       0.00164       0.00166      0.98697  13.08s
     95       0.00164       0.00166      0.98685  12.50s
     96       0.00164       0.00166      0.98672  13.03s
     97       0.00163       0.00165      0.98660  13.24s
     98       0.00163       0.00165      0.98649  13.79s
     99       0.00163       0.00165      0.98636  14.05s
    100       0.00162       0.00165      0.98623  13.76s
Out[52]:
NeuralNet(X_tensor_type=None,
     batch_iterator_test=<nolearn.lasagne.base.BatchIterator object at 0x11a15d6a0>,
     batch_iterator_train=<nolearn.lasagne.base.BatchIterator object at 0x11a15d630>,
     custom_score=None, hidden_num_units=300, input_shape=(None, 6750),
     layers=[('input', <class 'lasagne.layers.input.InputLayer'>), ('hidden', <class 'lasagne.layers.dense.DenseLayer'>), ('output', <class 'lasagne.layers.dense.DenseLayer'>)],
     loss=None, max_epochs=100, more_params={},
     objective=<function objective at 0x11a15f158>,
     objective_loss_function=<function squared_error at 0x119ec89d8>,
     on_epoch_finished=[<nolearn.lasagne.handlers.PrintLog object at 0x11e4f5fd0>],
     on_training_finished=[],
     on_training_started=[<nolearn.lasagne.handlers.PrintLayerInfo object at 0x11e4f5f98>],
     output_nonlinearity=None, output_num_units=102, regression=True,
     train_split=<nolearn.lasagne.base.TrainSplit object at 0x11a15d6d8>,
     update=<function nesterov_momentum at 0x119ed7268>,
     update_learning_rate=0.01, update_momentum=0.9,
     use_label_encoder=False, verbose=1,
     y_tensor_type=TensorType(float32, matrix))

Let's grab out the predictions!


In [53]:
preds = net1.predict(X_test)
preds.shape


Out[53]:
(7028, 102)

We're going to see how good the classifier did by examining the class labels. The way to visualize this is to have, say, the class labels on the X-axis, and the probability of prediction on the Y-axis. We can do this sample by sample. Here's a simple example with no frills in the matplotlib interface.


In [64]:
import matplotlib.pyplot as plt

%matplotlib inline

plt.bar(np.arange(len(preds[0])), preds[0])


Out[64]:
<Container object of 102 artists>

Alrighty, let's add some frills - the class labels, the probability of each class label, and the original class label.


In [68]:
### NOTE: Change the value of i to anything above!
i = 5942
plt.figure(figsize=(20,5))
plt.bar(np.arange(len(output_lb.classes_)), preds[i])
plt.xticks(np.arange(len(output_lb.classes_)) + 0.5, output_lb.classes_, rotation='vertical')
plt.title('Original Label: ' + output_lb.inverse_transform(Y_test)[i])
plt.show()
# print(output_lb.inverse_transform(Y_test)[i])


Let's do a majority-consensus rule applied to the labels, and then compute the mutual information score again.


In [69]:
preds_labels = []
for i in range(preds.shape[0]):
    maxval = max(preds[i])
    pos = list(preds[i]).index(maxval)
    
    preds_labels.append(output_lb.classes_[pos])

mi(preds_labels, output_lb.inverse_transform(Y_test))


Out[69]:
0.69295769512199501

With a score of 0.73, that's not bad either! It certainly didn't outperform the RandomForestClassifier, but the default parameters on the RFC were probably pretty good to begin with. Notice how little tweaking on the neural network we had to do as well.

For good measure, these were the class labels. Notice how successful influenza has been in replicating across the many different species!


In [70]:
output_lb.classes_


Out[70]:
array(['American_Black_Duck', 'American_Green_Winged_Teal',
       'American_Widgeon', 'American_Wigeon', 'Aquatic_Bird', 'Bald_Eagle',
       'Bantam', 'Barnacle_Goose', "Bewick'S_Swan", 'Bird', 'Black_Duck',
       'Black_Headed_Gull', 'Black_Scoter', 'Blue_Winged_Teal',
       'Blue_Winged_Teal_', 'Bufflehead', 'Canada_Goose', 'Canvasback',
       'Chicken', 'Cinnamon_Teal', 'Common_Coot', 'Common_Eider',
       'Common_Goldeneye', 'Common_Murre', 'Dog', 'Domestic_Cat',
       'Double_Crested_Cormorant', 'Duck', 'Dunlin', 'Eurasian_Teal',
       'Eurasian_Wigeon', 'Ferret', 'Gadwall', 'Garganey', 'Glaucous_Gull',
       'Goose', 'Greater_White_Fronted_Goose', 'Grebe',
       'Green_Winged_Teal', 'Guineafowl', 'Gull', 'Herring_Gull',
       'Hooded_Merganser', 'Human', 'Knot', 'LAB', 'Large_Cat',
       'Laughing_Gull', 'Lesser_Scaup', 'Long_Tailed_Duck', 'Mallard',
       'Mallard_Black_Duck_Hybrid', 'Mallard_Duck', 'Mew_Gull', 'Mink',
       'Murre', 'Muscovy_Duck', 'NA', 'Northern_Pintail',
       'Northern_Shoveler', 'Ostrich', 'Pacific_Golden_Plover', 'Panda',
       'Pelican', 'Pig', 'Pigeon', 'Pink_Footed_Goose', 'Pintail',
       'Poultry', 'Quail', 'Red_Crested_Pochard', 'Red_Knot',
       'Red_Necked_Stint', 'Redhead', 'Redhead_Duck', 'Ring_Billed_Gull',
       'Ring_Necked_Duck', 'Rosy_Billed_Pochard', 'Ruddy_Turnstone',
       'Sanderling', 'Sea_Mammal', 'Semipalmated_Sandpiper', 'Shelduck',
       'Shorebird', 'Shoveler', 'Sloth_Bear', 'Snow_Goose', 'Sooty_Tern',
       'Sparrow', 'Swan', 'Swine', 'Teal', 'Tufted_Duck', 'Turkey',
       'Unknown', 'Velvet_Scoter', 'Waterfowl', 'Weasel',
       'White_Fronted_Goose', 'White_Rumped_Munia', 'Widgeon', 'Wood_Duck'], 
      dtype='<U27')

The biology behind this dataset.

A bit more about the biology of influenza.

If you made it this far, thank you for hanging on! How does this mini project relate to the biology of flu?

As the flu evolves and moves between viral hosts, it gradually adapts to that host. This allows it to successfully establish an infection in the host population.

We can observe the viral host as we sample viruses from it. Sometimes, we don't catch it in its adapted state, but it's un-adapted state, as if it had freshly joined in from its other population. That is likely why some of the class labels are mis-identified.

Also, there are environmentally sampled isolates. They obviously aren't simply replicating in the environment (i.e. bodies of water), but in some host, and were shed into the water. For these guys, the host labels won't necessarily match up, as there'll be a stronger signal with particular hosts - whether it be from ducks, pigs or even humans.

Next steps?

There's a few obvious things that can be done.

  1. Latin hypercube sampling for Random Forest parameters.
  2. Experimenting with adding more layers, tweaking the layer types etc.

What else might be done? Ping me at ericmajinglong@gmail.com with the subject "neural nets and HA". :)


In [ ]: